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Abstract 

Selecting appropriately the proposal kernel of particle filters is an issue 
of significant importance, since a bad choice may lead to deterioration of 
the particle sample and, consequently, waste of computational power. In 
this paper we introduce a novel algorithm approximating adaptively the 
so-called optimal proposal kernel by a mixture of integrated curved ex- 
ponential distributions with logistic weights. This family of distributions 
is broad enough to be used in the presence of multi-modality or strongly 
skewed distributions. This "mixture of experts" is fitted, via Monte Carlo 
EM or online-EM methods, to the optimal kernel through minimization of 
the Kullback-Leibler divergence between the auxiliary target and instru- 
mental distributions of the particle filter. The algorithm requires only one 
optimization problem to be solved for the whole sample, as opposed to ex- 
isting methods solving one problem per particle. In addition, we illustrate 
in a simulation study how the method can be successfully applied to optimal 
filtering in nonlinear state-space models. 

Keywords: Optimal and backward kernel, Adaptive algorithms, Kullback- 
Leibler divergence, Coefficient of variation, Expectation Maximization, par- 
ticle filter 



1 Introduction 

During the last decade, sequential Monte carlo (SMC) methods have developed 
continuously from being a tool for solving rather specific problems, such as the op- 
timal filtering problem in general state-space models or simulation of rare events 
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(these topics are abundantly covered in Liu, 2001; Doucet et al., 2001; Ristic et al., 
2004; Cappe et al., 2005, and the references therein), to a tool for sampling sequen- 
tially more or less arbitrary sequences of distributions (Del Moral et al., 2006). 
The high flexibility of these methods makes it possible to use them for executing 
efficiently key steps of composite algorithms tackling very complex problems, e.g., 
as in particle Markov chain Monte Carlo (MCMC) methods for joint inference on 
static and dynamic variables (see Andrieu et al., 2010). 

However, while lots of research has been devoted to adaptation of MCMC 
methods (see especially Haario et al., 2001; Roberts and Rosenthal, 2007; Andrieu 
and Moulines, 2006; Roberts and Rosenthal, 2009), adaptive methods for SMC 
is still in its infancy. This article, which sets out from theoretical framework for 
adaptive SMC presented in Cornebise et al. (2008), is part of the effort to fill this 
gap, which is mandatory for achieving maximum efficiency and user- friendliness. 

In the SMC framework, the objective consists in approximating a sequence of 
target densities evolving sequentially by updating recursively a set of particles and 
associated importance weights. In order to describe precisely such an update, let 
v be a target probability density function over a space S = M d and suppose that 
we have at hand a weighted sample {(X^u^)}^ such that Q -1 J^jli ^jfO^j)) 
with Q = ^jLi^j' approximates J /(x)z/(x)dx for any z/-integrable function /. 
As a notational convention, we use capital letters for random variables, normal 
case for function arguments, and bold case for vectors. 

We now wish to transform (by moving the particles and adjusting accordingly 
the importance weights) the sample {(X^u^)}^ into a new weighted particle 
sample approximating some probability density fi on another space S given by 

J J z/(x)Z(x,x) dxdx 

where I is an un-normalized transition density. Here S can either be equal to B 
or be a space of different dimension. Having access to {(X^u^)}^, an approx- 
imation of ji is naturally obtained by plugging the weighted empirical measure 
associated with this weighted sample into (1.1), yielding the mixture density 

X^x^KX^x) _ / ^//(X^xQdx \ / Z(X i; x) 

* Eti^(X;,x)dx " {eU^IK^)^) VW,x)dx, 

(1.2) 

and ideally an updated particle sample would be obtained by drawing new par- 
ticles {X^}^ x independently from (1.2). There are however two problems with 
this approach: firstly, in general, the integral J /(•, x) dx lacks closed- form expres- 
sion, ruling out direct computation of the mixture weights; secondly, even if the 
integral was known on closed-form, the resulting algorithm would have an 0(N 2 ) 
computational complexity due to the normalization of the mixture weights. To 
cope with these issues, we will, as proposed by Pitt and Shephard (1999), aim at 
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sampling instead the auxiliary target distribution 

^«(i,x) = N = T(X^x) , (1.3) 

Ei=i ^ J ^( x ^ x ) dx E,=i ^ a *( x j) 

over the product space {1, . . . , TV} x S of indices and particle positions, where 
we have introduced the partition function a*(x) := J /(x, x) dx of /(x, •) and the 
normalized (Markovian) transition density /*(x,x) := /(x, x)/ / /(x, x) dx. To 
sample (1.3), we take an importance sampling approach and draw independent 
pairs {(JjjXi)}^ of indices and positions from the proposal distribution 



^(X^) 

Ef=i^ fl ( x i) 



tt(i,x) := _^ V ^_ r(X,,x) (1.4) 



over the same extended space and assign each draw X^) the importance weight 
cDi := i/;(/i,Xi), where 

w(z,x) := — oc 



a(X i )r(X ij x) tt(z,x) 

Finally, we discard the indices 1{ and keep {(X^c^)}^ as an approximation 
of \i. This scheme, which is traditionally referred to as the auxiliary particle 
filter^ encompasses the simpler framework of the bootstrap particle filter proposed 
originally by Gordon et al. (1993), which simply amounts to setting a(x) = 1 
for all x (implying a gain of simplicity at the price of flexibility). Moreover, 
SMC schemes where resampling is performed only at random times can similarly 
be cast into the setting of the auxiliary particle filter by composing the kernels 
involved in several consecutive steps of the of recursion (1.1) (see e.g. Cornebise, 
2009, Chapter 4, for details). Therefore, any methodology built for the auxiliary 
particle filter also applies to these simpler frameworks without modification. 

The choice of the nonnegative adjustment multiplier function a and the pro- 
posal kernel r (which is supposed to dominate /*) effects significantly the quality 
of the returned sample, and in this paper we focus on adaptive design of the latter. 
Doucet et al. (2000) suggest to approximate each function Z(Xj, •) by a Gaussian 
density whose mean and covariance are obtained using the extended Kalman fil- 
ter (EKF). This technique is computationally heavy for large sample sizes since 
it requires to solve an optimization problem for each particle x^. A somewhat 
simpler version, consisting in replacing the EKF by the unscented Kalman filter 
(UKF), was studied later by Van der Merwe et al. (2000); Van der Merwe and 
Wan (2003). Pitt and Shephard (1999) use a form of Laplace approximations 
for approximating /(X^, •) by simply centering a Gaussian density or the density 
of a student's i-distribution around the mode of the function in question. This 
technique is appropriate when the function is log-concave (or strongly unimodal). 
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Nevertheless, the localization of the mode requires solving an optimization prob- 
lem for each particle. Thus, in spite of this intense recent activity in the field, 
the state-of-the-art algorithms have met only mitigated success as they implicitly 
assume that the functions Z(Xj, •) have a single mode. 

A common practice (see e.g. Oh and Berger, 1993) for designing proposal 
distributions in standard (non-sequential) importance sampling is to consider a 
parametric family of proposal distributions and then identify a parameter that 
minimizes some measure of discrepancy between the target and the proposal dis- 
tributions. Common choices are the widely used squared coefficient of variation 
(CV) 

N / ~ n 2 

1=1 x / 

of the importance weights or the negated Shannon entropy 

1=1 x ' 

of the same. Both are maximal when the sample is completely degenerated, i.e. 
when one particle carries all the weight, and minimal when all the importance 
weights are equal. Note that the coefficient of variation often appears in a trans- 
formed form as the effective sample size given by ESS = N/(l + CV 2 ). 

The same quantities have been widely used also within the framework of SMC, 
but until recently only for triggering the resampling operation and not as a tool 
for adaptive design of the instrumental distribution. The key result of Cornebise 
et al. (2008) was to relate the Shannon entropy to the Kullback-Leibler diver- 
gence (KLD, recalled below) between the instrumental and proposal distributions 
of the particle filter. In addition, a similar relation between the CV of the parti- 
cle weights and the Chi-square divergence (CSD) between the same distributions 
was established. More specifically, as the number of particles tend to infinity, the 
CV and the entropy tend to the CSD resp. the KLD between the asymptotic 
target distribution /x*(x, x) oc /(x, x)z/(x) and the asymptotic proposal distribu- 
tion 7r*(x, x) oc a(x)r(x, x)z/(x) on the product space X x X, i.e. under weak 
integr ability hypotheses, as TV — »> oo, 

S A d KL (/X*|| 71"*) , CV 2 A d x 2 ((JL*\\ TT*) , 

rfKL(M aUX ||7T) ^d K L(M*||7T*) , rf x 2(/X aUX ||7r) ^d x 2(/X*||7T*) . 

This gives a sound theoretical support for using the CV and the entropy of the 
importance weights for measuring the quality of the particle sample. In addition, 
it suggests that the KLD and the CSD between /x aux and n could be used in lieu 
of £ and CV 2 , respectively, for all purposes, especially adaptation. As a matter of 
fact, in the context of adaptive design of SMC methods, the KLD is (as pointed 
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out in Cornebise et al, 2008, Section 2.3) highly practical since it decouples the 
adaptation of the adjustment weight function a and that of the proposal kernel r. 
Recall that the KLD between two distributions with densities p and q over some 
space S is defined as 



d 



KL 



jMS) 



p(x) dx , 



provided that the support of p is included in the support of q. Replacing p and q 
by the auxiliary target distribution /i aux and the proposal 7r, respectively, yields 



(Ikl (p au *\\n) = E, 



log 



IE^aux 


log- 







/i aux (/X) 
vr(/X) 

r(x 7 ,x) 

r(X,,X) 



;i.5) 



— 1~ IE^aux 



log 



w / o(Xj)/X)5LiW i o(X i 



Depends only on r 



Depends only on a 



;i.6) 



The first term corresponds to the discrepancy induced by proposing the particles 
X with a suboptimal proposal kernel, and the second term corresponds to the 
discrepancy induced by choosing the ancestor indices I according to suboptimal 
adjustment weights. Moreover, rf K L (/x aux ||7r) equals 



logr(X / ,X) 



IE^aux 



log 



a(Xj) 



(1.7) 



up to additive terms involving the optimal quantities a* and I* only and thus 
being irrelevant for the optimization problem (equality up to a constant will in the 
following be denoted by =). Restricting ourselves to the adaptation of the proposal 



k) = -E fJ 



log r(Xj,X) 



kernel, we obtain the simple expression rf K L 
Although the d^L is most often not directly computable, such a quantity can 
be approximated on-the-fly using the weighted sample obtained; the resulting 
algorithm closely resembles the cross-entropy (CE; see Rubinstein and Kroese, 
2004) method. 

Henceforth, in the present article, we select the auxiliary proposal distri- 
bution 7r from a family {ttq : G 0} of candidates by first solving 0* := 
argmin^ G0 rf K L(M aux |K^) and then letting tt = no*. On the one hand, the chosen 
parametric family should be flexible enough to approximate complex transition 
kernels. On the other hand, it should be simple enough so that sampling from 
ttq is easy. Finally, the parameterization should be done in such a way that the 
problem of estimating the parameters is as simple as possible. In this article, we 
suggest modeling the proposal ttq as a mixture of integrated curved exponential 
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distributions. The mixture weights are chosen to be dependent on the ancestor 
particle Xj, allowing for partitioning of the input space into regions correspond- 
ing to a specialized kernel. Each component of the mixture belongs to a family 
of integrated curved exponential distributions, whose two most known members 
are the multivariate Gaussian distribution and the student's ^-distribution. Also 
the parameters of the mixture depend on the particle positions Xj. This pa- 
rameterization of the proposal distribution is closely related to the (hierarchical) 
mixture of experts appearing in the machine learning community and described in 
(Jordan and Jacobs, 1994; Jordan and Xu, 1995). The flexibility of our approach 
allows for efficient approximation of the optimal kernel for a large class of intricate 
(nonlinear /non-Gaussian) models. 

The paper is organized as follows. The mixture family under consideration is 
described in Section 2 and in Section 3 we treat optimization of the parameters. 
In Section 4, two versions of the adaptive particle filters are presented and in 
the last part, Section 5, we illustrate the efficiency of the method on several 
simulation-based examples. 



2 Mixture of experts 

In this contribution, we let the proposal kernel have density 



/"a (x,x) := ^a j (x,^)p(x,x;r7 j ) , (2.1) 



where the functions {&j}j =1 are so-called weighting functions and p(-,-;?7 J ) are 
Markov transition kernels from S to H. Denote := (/3, 77) with 77 = (77^ . . . , r/ d ). 
In the auxiliary particle filter framework, the associated proposal distribution is 
defined as 



• (2.2) 



We then assign the importance weight uf = w 6 (I i) Xj) to each draw X$) from 
7T0, where 

w o (i x) = l S^l3 oc gfog (2 3) 

a(X i )Y, d j=1 a j (X i ,0p(X i ,5t-,ri j ) Mh*) 

The weighting function partitions the input space S into regions which are as- 
sociated to a few specialized transition kernels in the mixture. This is done by 
assigning a vector of mixture weights to each point of the input space. As in Jor- 
dan and Xu (1995), we consider logistic weight functions, i.e. (3 = (/3 l3 . . . ,^-1) 
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and 

:= i^^i ( ^ X L-n - l<J<d-l 5 (2-4) 

1 + Ejb=i ex P (A x ) 

where x := (x, 1) and /3 •, 1 < j ' < rf — 1, are vectors in K p+1 . 

In some cases it is of interest to resort to a simpler mixtures whose weights do 
not depend on x by letting, for (3 G such that EjLi fij = ^ ^ (x, /3) := 
1 < j < independently of x and hence without partitioning the input space. 
This model for the transition kernel is then similar to the switching regression 
model (Quandt and Ramsey, 1972). 

To ease the implementation — see Section 3 — the kernels p (x, x; 77) are assumed 
to have the following integrated curved exponential form: 



p(x,x;77)= / p e (x, x, u; 77) du , (2.6) 
Ju 

where p e (x, x, u; 77) is a curved exponential distribution 

p e (x, x, u; 77) = 7 (u)/i(x, x, u) exp (-A(r,) + Tr (s (x, x, u) T J B( ?7 )) ) . (2.7) 

In other words, p (x, x; 77) corresponds to the marginal (in (x, x)) of a curved 
exponential distribution p e (x, x, u; 77), where u is an auxiliary variable taking 
values in some space U and 7 is the marginal distribution of u. In (2.7), S is the 
vector of sufficient statistics. 

We now augment (/, X) with the index J of the mixture component as well as 
the auxiliary variable U of the curved exponential family. This extended auxiliary 
variable (/, J, X, U) is distributed according to 

7rg(i,j,x,u) := — P^y- - aj(Xi,0)p e (X*,x, u; 17.) (2.8) 

no(hj>x> u ) '•= - ajiXitPjp* (X^x^u;^) 

on the product space {1, . . . , TV} x{l,...,rf}xSxU. It is easily checked that 
ttq is the marginal density of 7rg in I and X. 

Efficient fitting of the optimal parameter 0* minimizing the KLD (1.7) is 
the topic of the Section 3. We will illustrate the details for the specific case of 
multidimensional Gaussian distributions and multidimensional i-distributions. It 
should however be kept in mind that the algorithm is valid for any member of the 
family of integrated curved exponential distributions as long as Assumption 3.1 
below is fulfilled. 
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3 Parameter estimation techniques 



Parameter estimation for mixtures of experts is most often carried through us- 
ing the EM algorithm (see McLachlan and Krishnan, 2008, Section 8.4.6) and 
also in this paper a recursive EM-type algorithm will be used for finding opti- 
mal mixture parameters. In order to describe precisely the algorithm, let 6 l ~ l 
be a given parameter estimate, obtained after t — 1 iterations of the scheme. At 
iteration I, a new fit d e of the parameter is obtained as the argument 6 maximiz- 
ing the intermediate quantity Q(0,6 e ~ 1 ) := E^auxpE^-i [logng(I, J, X, U)|J, X]], 
where 7r| is the extended auxiliary distribution in (2.8) and E e-i denotes expec- 
tation w.r.t. the distribution ^ e i-i- The intermediate quantity may be decomposed 
- Qi(P, O"- 1 ) + Q 2 (T7, 0"- 1 ), where 



as g(6>,^" 1 ) 



Q 2 (ri,0 e - 1 ) :=E M a UX 



E^-i 



log a j (X j, /3) 



/,X 



logp e X^X.U;^ 



/,x 



(3.1) 
(3.2) 



Therefore, at each iteration, maximization of the intermediate quantity may be 

split into the two subproblems (3 £ = arg max^ Qi(/3, O^ 1 ) and rf = arg max^ Q 2 (^ 7 O^ 1 ). 

Under weak additional assumptions, it suffices to find a value increasing Q(0, Q^ 1 ) 
over its baseline, i.e. Q(0 £ \0 l ~ l ) > Q(^ _1 ,^ _1 ), in order to obtain a sequence 
{0 £ }i>o converging to the optimum 0*; this is of special importance when using 
logistic weights, as finding the exact optimum of the intermediate quantity could 
be expensive in that case. 



3.1 Updating of logistic weight parameters 

For the logistic weights (2.4), finding the maximum in (3 of Qi{P,0 £ ) on closed 
form is out of reach but can be performed using iteratively reweighted least squares 
(IRLS), as in (McLachlan and Krishnan, 2008, Equations (8.32) and (8.33)). 
By defining the conditional mixture weights 



^•(Xj,^)p(X i ,x;ry j ) 



e /.i. ~x 3 V 1 ' /__/ ' \ * ' ' ' 3 J / q q \ 

n e U \h x) = — d — ^ — , (3.3) 



we may express the fcth block (of size (p+ 1)) of the gradient vector \^Qi(/3, £ x ) 



as 



V^g 1 (A^- 1 )=E M a UX [{tt^-x (fc|/,x) -a fc (Xj,^)}Xj] (3.4) 
for any k G {1, . . . , d — 1}. Similarly, for any (k,m) G {1, . . . , d— l} 2 , block (k, m) 
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(of size (p + 1) x (p+ 1)) of the Hessian matrix £ ) can be expressed as 



V^V^Q 1 (A^" 1 )=E, 



a fc (X 7 ,/3) <^« m (X 7 ,/3) 



1 



{/c=m} 



l + Eti exp (/3fX,) 



(3.5) 

where, for any vector or matrix A, A 2 is shorthand for AA T . Note that the Hes- 
sian is the conditional expectation of the complete data negated observed Fisher 
information matrix, and is therefore definite negative. These results stem directly 
from the multinomial logistic regression by computing conditional expectations. 
The EM intermediate quantity (3 H> Q 1 ((3,9 i ~ 1 ) is therefore concave. We could 
thus achieve exact optimization at each iteration I using IRLS or running the 
Newton- Raphson scheme until convergence (Chen et al., 1999, who also correct 
the erroneous expression of the Hessian matrix in IRLS), which is commonly done 
in multinomial logistic regression. However, a single Newton-Raphson step is 
enough to ensure that the intermediate quantity is increased, which implies con- 
vergence of the EM scheme. Consequently, an EM iteration may be written as 



e-i 



re 



vjQ^y- 1 )! 



VgQi&fl*- 1 )! 



(3.6) 



for some positive step size (or learning rate) T£ < 1. Following McLachlan and 
Krishnan (2008), the learning rate is set to T£ — 1. 

3.2 Updating of strata parameters 

We now turn to estimation of the strata parameters 77. 

Assumption 3.1 (Closed form of integrated sufficient statistics). In the sequel, it is 
assumed that, for any rj and any j 6 {1, . . . , rf}, the integrated sufficient statistics 

j ^(X^u^ulX^T^du (3.7) 

are available on closed form. This is for instance the case for Student's t-distributions. 

Under Assumption 3.1, for any j £ {1, . . . , d} we define the expected sufficient 
statistics of component j as 



Pj {0) • — IE^aux 


} e e (j 


j,x) 


1 


(3.8) 


S j {0) ! = IE^aux 


*l (j 


j,x) 


J S (Xj, X, u) p e (u|X 7 , X; rtj) du 


. (3.9) 



We now derive the explicit update expressions for the two common special cases of 
multivariate Gaussian and multivariate Student's i-distributions; it should how- 
ever be kept in mind that the method is valid for any integrated curved exponential 
distribution satisfying Assumption 3.1. 
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3.2.1 Updating formulas for Gaussian strata 



To each component j, assign a linear Gaussian regression parameterized by rjj = 
(Mj,Sj). Thus, given J = the new particle X has ^-dimensional Gaussian 
distribution with mean Mj X/, where the regression matrix Mj is of size p f x (p+1), 
and symmetric, positive definite covariance matrix Ej of size p r x p f . In this case, 
the sufficient statistics are 5(x,x, u) = (x 2 ,x 2 ,xx T ), leading to the expected 
sufficient statistics 

Pj (O 1 - 1 ) :=E^ 7Tq(>-i (./ /.X 

5j,2 X ) •= E^au: 



t-1 \ J 

[J 



J,X) X?' 



5,- 1 (0'- 1 ) := E p 
5 i(3 C^" 1 ) := E p 



V-i U 

TT^-x [J 



J,XJ X 2 

£) X 

(3.10) 



j,x) xxj 



In addition, the functions A and 5 in (2.7) are given by 



A(rj) = -\og\E\ , B{rj) 



--E" 1 , -iME _1 M T , -E _1 M 
2 ' 2 



Consequently, the argument 77 maximizing Q 2 (r},6 e x ) is given by = M (Sj (6^ *)) 
and E^ = E (P,^ -1 ), ^ where the functions M and E are defined by 



M (Sj) := S jt3 S~l and E (P,-, Sj) := 



(3.11) 



Remark 3.1 (Pooling the covariances) . In practice, a robustified version of the 
algorithm above can be obtained by pooling the covariances. This means that 
a common covariance matrix is used for all the components of the mixture, i.e. 
Sj = E for every j. By doing this, the well-known problem of mixture models with 
strata components degenerating to Dirac masses is avoided. It is straightforward 
to enforce such a restriction in the optimization procedure above, leading to 



1 d 



(3.12) 



k=l 



for all j. 



3.2.2 Updating formulas for Student's ^-distributed strata 

A common advice in Importance Sampling (see, for instance, Oh and Berger, 1993) 
is to replace Gaussian distributions by Student's i-distributions with location 
parameter /i, scale parameter E, and a chosen number v > of degrees of freedom. 
Therefore, as an alternative, robustified version of the Gaussian mixture model 
above, we propose using instead a ^/-dimensional Student's ^-distribution with v 
degrees of freedom, i.e. p (x, x; 77^) = t v > (x; Mj x, Ej, v). 
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Remark 3.2 (Fixing the degrees of freedom). The number v of degrees of freedom 
of the Student's i-distributions is fixed, typically to v E {3,4}, beforehand and 
is common to all the components. A similar choice has been made by, among 
others, Peel and McLachlan (2000, Section 7) and Cappe et al. (2008) as it allows 
for closed form optimization in the M-step. 

These kernels can be cast into the framework of Section 2 thanks to the 
Gaussian- Gamma decomposition of multivariate Student's i-distributions used in 
(Liu and Rubin, 1995, Section 2) and (Peel and McLachlan, 2000, Section 3), 

t (x; /x, E, v) = J M (x; /x, E/u) 7 (u; |, du , 

where 7 (u; a, 6) := 6 a u a_1 exp (— 6u) l^+(u)/T(a) is the density of the Gamma 
distribution with shape parameter a and scale parameter b. The multivariate 
Student's i-distribution kernel is hence an integrated curved exponential distri- 
bution (2.6) with 7(11) = 7(u;|,|), /i(x, x, u) = (27r) - ^ 2 , sufficient statistics 
aS(x, x, u) = (ux 2 , ux 2 , uxx T ), and the same A and B as in the Gaussian case. 
Since the Gamma distribution is a conjugate prior w.r.t. precision for the Gaus- 
sian distribution with known mean, it holds that 



[U|x,x] 



v + p' 



i/ + 5(x,Mx,E) ' 



(3.13) 



where 5 (x, /i, E) = |x — = (x — /x) T E 1 (x — /i) is the Mahalanobis dis- 

tance with covariance matrix E. This leads to the expected sufficient statistics 



E, 



E„ 



7,X 
7,X 

7,X 

7,X 







f +p' 




(X^X^e/) 




(X^X^e/) 



X 2 



T 

xx 7 



(3.14) 



Since the functions A and B are unchanged from the Gaussian case, the same 
equations (3.11) can be applied for updating the parameter 77 based on the ex- 
pected sufficient statistics (3.14). Covariance pooling is also achieved in exactly 
same way as in Remark 3.1. 
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4 Stochastic approximation and resulting algo- 
rithm 



The algorithms described so far are mainly of theoretical interest, since com- 
putation of the sufficient statistics Sj^ (0) defined in (3.9) involves computing 
expectations w.r.t. to the target distribution /x aux . We may however approximate 
the same using importance sampling with a relatively small sample size. This is 
described in the following. 

At iteration we sample a block of N £ particles from tv 6 i-i and form each 
estimate of an expected sufficient statistic Sj^ as a convex combination of the 

former estimate S^ 1 and an importance sampling approximation based on the 
obtained particle block. The weights of the convex combination are given by a 
sequence {A^}^ of positive step sizes such that Y^tLv ^ = 00 an d Y^T=v tf> < 00 > 
with A = 1 for initialization. This is essentially a Monte Carlo version of the 
generic online EM algorithm of Cappe and Moulines (2009); the main difference is 
that the integrals are w.r.t. to the target distribution instead of w.r.t. the empirical 
distribution of some observations as in classical EM. 
As showed by Douc et al. (2008), it holds that 

as N tends to infinity; therefore, relying on the weig hted sample {{if , xf)}^ 
produced at the current iteration, we use 



(1 - W" 1 + Ei>f , for£>l, 



as normalizing constant of the importance sampling estimate. We then estimate, 
for any function F, the expectation E M aux[F(0, /, X)] by 



N £ „ [£] 

F(6,I,X)\ := (1- A,)< a -i ] |>(0*-\J,X)1 +A^^F(0;/f ] ,xf ] ) . 



N e C e 

1=1 



(4.1) 

Obviously this stochastic approximation of the normalizing constant does not 
guarantee that the renormalized weights sum to 1 , and hence leads to approxima- 
tion of a probability measure with a finite measure. However, this does not not 
impede the convergence of the stochastic approximation. Moreover, the weights 
of the final sample obtained from the adapted kernel are renormalized to sum to 
one: the final approximation therefore remains a probability distribution. 

Past the initial step t = 0, which provides the first optimized fit, the numbers 
N e of particles sampled at each iteration I > 1 can be noticeably small. A 
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convergence proof of this stochastic approximation algorithm will be the topic 
of the aforementioned companion paper and is rooted in the analysis of generic 
Online EM (Cappe and Moulines, 2009), stochastic approximation EM (SAEM, 
Delyon et al, 1999; Kuhn and Lavielle, 2004; Andrieu et al, 2005) and Monte 
Carlo EM (MCEM, Fort and Moulines, 2003). In the present paper we limit 
ourselves to numerical illustrations of the proposed scheme; see the next section. 

Remark 4.1 (Proposal kernel at the initial iteration). Finally, we may choose to 
draw the particles of the first step from a distribution different from n e o in order 
to avoid relying on an arbitrary initial choice which is often quite poor. When 
operating on state-space models, as in the examples presented in the next section, 
we obtain the first sample )}^° 1 using the prior kernel — the simplest 

default choice. 



5 Applications to nonlinear state-space models 

SMC methods can be successfully applied to optimal filtering in state-space mod- 
els. A state-space model is a bivariate process (X k ,Y k ) k >o, where X = (X k ) k > 
is an unobserved Markov chain on some state-space X C M d and (Y k ) k >o is an 
observation process taking values in some space YCP. We denote by Q and 7r 
the transition density and initial distribution of X, respectively. Conditionally on 
X, the observations are assumed to be conditionally independent with the condi- 
tional distribution G of a particular Y k depending on the corresponding X k only. 
We will assume that Q and G have densities q and g, respectively, with respect 
to Lebesgue measure, i.e. 

F(X k+1 G A\X k ) = Q(X k , A) = [ q(X k , x) dx (5.1) 

J A 

and 

F(Y k G B\X k ) = G(X k , B)= f g(X k , y) dy . (5.2) 

JB 

For simplicity, we have assumed that the Markov chain X is time-homogeneous 
and that the distribution G does not depend on k; however, all developments 
that follow may be extended straightforwardly to the time-inhomogeneous case. 
The optimal filtering problem consists of computing, given a fixed record Y 0:n := 
(Yo, . . . ,Y n ) of observations, the filter distributions (j) k {A) := ¥(X k G A|1q:/c) f° r 
k = 0, 1, . . . , n. As the filtering distributions satisfy the nonlinear recursion 

^/ x / fffofc+i, yfc+i)g(sfc, x k+1 )(j) k {x k ) dx k 

k fc+i j j g ( Xk+1 ^ y kJrl ) q [ Xk) x k+1 )(j) k (x k ) dx k dx k +i ' 

the optimal filtering problem can be perfectly cast into the sequential impor- 
tance sampling framework (1.1) with H = H = X, u = (j) k) /a = 0/e+i, and 
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/(x,x) = gf(x,i/ife)g(x,x). Consequently, a particle sample {(X^u^)}^ approx- 
imating (f) k can be transformed into a sample (X^c^)^ approximating the fil- 
ter (j)k+i at the subsequent time-step by (1) drawing indices {h}f=i multino- 
mially with weights proportional to (a^)^, (2) simulating particle positions 
~ r(X^, •), and (3) assigning each updated particle X^ the importance weight 
uji = g(X i ,y k+1 )q(X Ii ,X i )/(a Ii r(X Ii ,X i )). Here {ai)f =1 is a set of nonnegative 
adjustment multipliers and the proposal r is a Markov transition density. In the 
following examples we illustrate how the proposal r can be designed adaptively 
using mixtures of experts. 

5.1 Multivariate linear Gaussian model 

We start by considering a simple multivariate linear Gaussian model. In this toy 
example, the optimal kernel L* and the optimal adjustment weight function a* 
are available on closed form. This makes it possible to compare our algorithm to 
an exact reference. The optimal kernel does not belong to our family of mixture 
of experts. In the model under consideration: 

• The distribution v on X = E 2 is bimodal with density i/(x) = 0.5A/" 2 (x; (0, 1) T , E) + 
0.5A/" 2 (x; (0, -1) T , E), where E = 0.1/ 2 - The two modes are hence well sep- 
arated. 

• The prior kernel density g(x, x) = 0.5A/2 (x; Ai x, E) + 0.5A/2 (x; A 2 x, E) is 
a mixture of multivariate Gaussian distributions with regression matrices 

a /l 1\ A / 1 1 \ 

Al= (o i i ) ' A2= U 1 -1 ) ' 

• Each y^, taking values in Y = ]R 2 , is a noisy observation of the corresponding 
hidden state X k with local likelihood #(x, y) = A2 (?/; x, Sy), where Ey = 
0.1/ 2 . 

In this first example, we consider one single step of the particle filter, with initial 
particles {X^}^ x sampled exactly from the initial distribution. The observation 
Y k is chosen to be Y k = (1,0) T = A 2 (0, 1) T = Ai(0, -1) T . The ancestor of the 
hidden state around which this observation is centered is equally likely to belong 
to any of the two components of the original distribution, depending on which 
(unknown) component of the prior kernel was used to sample the new state. Even 
though the optimal kernel L* does not belong to our family of experts, it is still 
a mixture whose weights are highly dependent of location of the ancestor, and we 
can expect our algorithm to adapt accordingly. 

The histogram in Figure la of the importance weights of the particle swarm 
produced using the prior kernel shows that 50% of these weights are nearly equal 
to zero, and the remaining ones are spread over a large range of values. This 
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is confirmed by looking at the corresponding curve of proportions in Figure lb, 
showing that 80% of the total mass is carried by only 25% of the particles and 
99% of the total mass by 40% of the particles. 
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Figure 1 : Evolution of the distribution 
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One single iteration of our SA-based adaptation scheme described in Section 4 
is sufficient to improve those proportions to 80% of the mass for 40% of the 
particles and 99% of the mass for 55% of the particles; see the corresponding 
curve of proportion in Figure lb. After 10 such iterations, the corresponding 
curve of proportions in Figure lb shows that close to maximum improvement 
has been reached: the first few steps of the optimization are enough to bring 
significant improvement. The histogram in Figure la of the weights obtained at 
the final iteration shows how the weights are concentrated around unity, the value 
that corresponds to sampling with the optimal kernel. 

We display in Figure 2a the KLD (1.6) between the fit and the target (esti- 
mated by means of a Monte Carlo approximation with a large sample size), and 
the KLD for the proposal with prior kernel and for the proposal with optimal 
kernel with uniform adjustment weights — the optimal adjustment weights are 
near uniform in this example. As mentioned earlier, the KLD decreases very fast, 
most of the improvement being obtained in the first few iterations. Figure 2b 
compares the evolution of the KLD for several step sizes covering four order of 
magnitude. As the step size is increased, the algorithm converges faster, although 
less smoothly: this has no practical impact as we are only looking for a good 
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proposal kernel in an importance sampling setting, not for the exact optimal one. 
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Figure 2: Evolution of the KLD over L = 20 iterations of adaptation for the LG 
model. 



5.2 Bessel process observed in noise 

As a more intricate example, we consider here optimal filtering of the Brownian 
motion underlying a Bessel process observed in noise, also known as range-only 
filtering of a Gaussian random walk. We let the state process evolve in the plane 
for visualization purposes. More specifically, the state-space model is given by 

Xk+i — Xk + Vk , 
Y k = \\X k \\+W k , 

where A/*2 (0, E^) and Wk ^ A/"(0,a 2 ) are mutually independent, and 

11^ II = ^Y^i=i ^(i) i s L 2 norm on the state space X := IR 2 . With the notations 
of (5.1) and (5.2), we define by g(x,x) := N 2 (x;x, E x ) the density of the prior 
kernel and by g(x, y) := N (y; ||x|| , a 2 ) the local likelihood. We ensure a diffuse 
prior kernel and informative observations by setting E x = I 2 and variance a 2 = 
0.01. As the hidden state is observed range-only, the state equation provides most 
of the information about the bearing while the local likelihood is invariant under 
rotation of the state around the origin. This induces a variety of nonlinear shapes 
of the optimal kernel depending on the location of the ancestor — see the three top 
rows of Figure 3. 

The original weighted sample {(X^, uJi)}f =l consists of a sample of = 20, 000 
i.i.d. particles drawn from the prior distribution v = Af 2 ((0.7, 0.7), 0. 5/2) of the 
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state at time (and hence having uniform weights). The first observation is 
Y\ = 1.0. We initialize the algorithm with N\ = 1, 000 particles {(X^, using 
the prior kernel. The resulting cloud, along with the 100 particles with the largest 
weights and their ancestors, is plotted in Figure 4 (top row). The support of the 
proposal distribution is over-spread: most of the particles have negligible weights, 
and a few particles have comparatively large weights: 80% of particles have null 
weight. This is confirmed by the curve of proportions in Figure 5 (left): only 20% 
of the proposed particles carry the total mass. Adaptation of the proposal kernel 
is thus relevant. Here again, adaptation of the adjustment weights is not required, 
as the ancestors of the particles with highest importance weights are not located 
in any specific region of the space. 

Based on these 1,000 particles, the first iteration of the adaptation is done 
using conditional probabilities from the initial fit displayed in Figure 3 (fourth 
row) whose components are chosen to be independent of the ancestor, i.e. only 
the constant term is non-null. The resulting kernel is plotted in Figure 3 (fifth 
row). We use it to propose 200 new particles, serving as importance sampling 
approximation of /x aux at iteration 2. After 30 such iterations, the adapted kernel 
visible in Figure 3 (last row) is visually very close to the optimal kernel. Note 
the impact of the location of the ancestor on the un-normalized transition kernel, 
whose mass shifts on a circle, and how the adapted kernel shifts accordingly. 
Figure 4 (bottom row) shows that adaptation concentrates the particles on the 
support of the target distribution, thus narrowing the span of the importance 
weights and balancing the contributions of all particles. 

Most importantly, a very small number of iterations suffices to achieve sig- 
nificantly more uniformly distributed importance weights, i.e. significantly lower 
KLD: Figure 5 (right) shows how the KLD drops after the first 2 iterations and 
then stabilizes near null, while the curve of proportions in Figure 5 (left) shows 
that the distribution of the weights is essentially unchanged past the first few 
iterations. 

As a final look into the convergence, Figure 6 displays the evolution of all the 
estimated parameters over the 30 iterations, confirming that the fit stabilizes after 
a few steps: the result of the very first couple of iterations could serve as a more 
efficient proposal than the prior kernel. The parameters of the logistic weights /3 £ 
are the slowest to stabilize due to the stochastic gradient used to palliate for the 
lack of closed- form update as mentioned in Section 3. 

We now iterate the same procedure over 50 time-steps and compare the adap- 
tive filter with the plain bootstrap filter. The forgetting of the initial distribution 
of the random walk entails that the filter distribution flow converges to a distribu- 
tion symmetric under rotation and supported simply on a ring centered at (0, 0). 
To connect with widespread measures of efficiency discussed in Cornebise et al. 
(2008), we compare in Figure 7 the negated Shannon entropy (lower is better) 
of the importance weights and relative effective sample size (in percentage of the 
total sample; higher is better) of the two filters. The negated Shannon entropy 
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Figure 3: Visual evolution of the target and adapted kernels in the Bessel model, 
for 3 distinct ancestors in different regions, for observation y — 1.0. 
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Figure 4: Comparison of 1, 000 and 200 particles from the prior and final adapted 
kernels, respectively, with the 10% of largest weights plotted in red, along with 
their ancestors and the distribution of their importance weights. 




Figure 5: Improvement of the adaption over 30 iterations. The proportions plot 
(left) and the KLD (right) dramatically improve over the very first iterations. 
While the prior kernel puts 90% of the mass on 15% of the particles, adaptation 
increases it to 70% of the particles in one step and stabilizes to 80% after a very 
few iterations. 
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Figure 6: Parameters Q i of the adapted kernel over 30 iterations of the algorithm: 
practically, convergence is achieved after a few steps only. 

shows improvement by our adaptive filter, which is not surprising in the light of 
the convergence results of (Cornebise et al., 2008, Theorem 1) recalled in Section 1: 
it converges to the same KLD between the same distributions when the number 
of particles tends to infinity. More remarkably, in this example, the relative effec- 
tive sample size also shows an improvement, while the same convergence results 
prove it to be linked to the chi-square divergence instead of the KLD. While a 
first intuition would make sense of improving the CSD by minimizing the KLD, 
this is not systematically the case. 
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Figure 7: Evolution of the effective sample size and the entropy between target 
and proposal for the bootstrap filter and the adaptive filter. As intended, the 
adaptive filter leads to higher effective sample size and lower negated Shannon 
entropy. 
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5.3 Multivariate tobit model 



We now briefly illustrate how far adaptation of the proposal kernel goes — and 
where it stops. Consider a partially observed multivariate dynamic tobit model 

X k = AXk-i + Uk , 
Y k = m&x(B T X k + V kl 0) , 

where X = M 2 , A is a 2 x 2 invertible matrix, and B = (1,1) G I 2 , so that Y k 
takes values in R. The random variables U k and V k are Gaussian white noises 
with variances and respectively. The observation process consists of noisy 
observations of the sum of its components of the hidden states. In addition, 
the observations are left-censored. We have an original swarm of TV = 20, 000 
particles distributed according to X G ~ J\f 2 ((1, 1) T , IO/2) and we set Y\ — 0. The 
local likelihood is hence null above the line A = {x e K 2 : B T x = 0}, and 
constant below, with a narrow transition region, as displayed in Figure 8 (second 
row) . The prior kernel displayed in Figure 8 (first row) can have most of its mass 
out of the high-likelihood regions, depending on the ancestor. Figure 8 (third 
row) illustrates the un-normalized optimal transition kernel for three reference 
ancestors, showing how the match between the supports of the proposal kernel 
and of the local likelihood varies depending on the position of the ancestor particle 
relatively to the line A -1 A. Hence half of the original particles have very small 
adjustment multiplier weights. 

Adapting the proposal kernel will hence require at least two components, and 
will not lead to perfect adaptation, as can be seen on the fit after 10 iterations 
displayed in Figure 8 (last row). The ancestor (1,1) is the center of the ancestor 
sample, un-normalized optimal kernel is the prior kernel truncated in its middle; 
(—3,-1) is the bottom left of the ancestors sample, the un-normalized optimal 
kernel almost matches the prior, save for a truncation in the upper right tail; (9, 5) 
is the top right of the ancestor sample, the un-normalized optimal kernel differs 
widely from the prior kernel, as only the very far tails of the latter match non-null 
local likelihood. 

Finally, without getting into details, we make our point through Figure 9 show- 
ing again how the KLD drops in the first iteration for families of both Gaussian 
distributions and Student's ^-distributions. It then stabilizes close to the mini- 
mum achieved by the optimal kernel with uniform adjustment weights, which is 
not negligible. The Student's ^-distribution allows heavier tails, at the price of a 
higher attainable lower bound on the KLD. 

6 Future work and Conclusion 

Relying on the results of Cornebise et al. (2008), we have built new algorithms 
approximating the so-called optimal proposal kernel at a given time step of an 
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Figure 8: Visual evolution of the adapted kernel, for 3 ancestors in different regions 
for the Tobit model. 
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Figure 9: Comparison of the evolution of the KLD for the Gaussian experts and 
the Student's i-distributed expert over 5, 000 iterations of the algorithm. 



auxiliary particle filter by means of minimization of the KLD between the auxiliary 
target and instrumental distributions of the particle filter. More specifically, the 
algorithm fits a weighted mixture of integrated curved exponential distributions 
with logistic weights to the auxiliary target distribution by minimizing the KLD 
between the two using a Monte Carlo version of the online EM method proposed 
by Cappe and Moulines (2009). 

In addition, we have applied successfully this relatively simple algorithm to op- 
timal filtering in state-space models; indeed, running the stochastic approximation- 
based adaptation procedure for only a few iterations leveled off significantly the 
distribution of the total weight mass among the particles also for models exhibit- 
ing very strong nonlinearity. Thus, adding the adaptation step to an existing 
particle filter implementation implies only limited computational demands. 

A proof of convergence, along the lines mentioned in Section 4, of the algorithm 
is currently in progress. In addition, we investigate at present the possibility of 
extending the approach to comprise adaptation also of the adjustment multipli- 
ers. 
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